Structure of Dataset

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

There are total 1599 rows and 13 columns are there.

Names of variables

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Head of Dataset

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Summary

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Our target variable is Quality, which has min 3.000, max 8.000 and Mean of 5.636.

Univariate Plots Section

Now we will explore each feature independently.

Quality

Quality is our target variable.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

quality has min 3.00, max 8.00 and Mean 5.636.

Let’s draw a histogram of quality

Most of the wines have quality either 5 or 6. Very few wines have quality 3 or 8. How much percent of wines have quality 5 or 6?

## [1] 0.8248906

82.5% total wines have quality as 5 or 6.

citric.acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

We can’t clearly see the pattern here. Let’s put binwidth to see the distribution clearly.

There is a big spike when citric.acid is 0.00. Total there are there major peaks at 0.00, 0.25 and 0.50. After 0.50 the count starts reducing. How many wines have citric acid = 0?

## [1] 132

There are 132 wines that have 0 citric acid.

residual.sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

residual.sugar has extreme outliers. Let’s use boxplot to explore more about this feature.

mean of residual.sugar is 2.539. But there are some values that have residual.sugar 15.500. Most of the values are below 4.000

total.sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Mean is 46.47 and Max is 289.00.

This distribution is positively skewed. Most of the values are less than 160. Let’s log transform this.

Log transform appears to be normally distributed.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The histogram plot shows pH is normally distributed at Mean 3.311 and most values are in between 3.0 and 3.6

sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

There are some high values. Let’s trim the values.

Now the histogram looks normally distributed. Most of the values are in between 0.35 and 0.95

alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Let’s change binwidth to get more clarity

This is positively skewed. Let’s log transform this.

The plot shows the peak is at 9.75.

Univariate Analysis

What is the structure of your dataset?

A: There are total 1599 observations with 12 features, including “fixed.acidity”, “volatile.acidity”, “citric.acid”, “residual.sugar”, “chlorides”, “free.sulfur.dioxide”, “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, “quality”. quality : 0 - 10 (worst —> best). Most of the values are in between 5-7

What is/are the main feature(s) of interest in your dataset?

A: The quality of the wine is based on smell, taste and color of the wine. So we can roughly say that citric.acid, sulphates and alcohol are influencing the quality of wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

A: I think acidity(fixed/volatile), chlorides may contribute to the quality of wine.

Did you create any new variables from existing variables in the dataset?

A: No. I didn’t create any new variables in the dataset.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

A: I used coord_cartesian to limit x axis for sulphates because there are some extreme outliers. The positively skewed histograms of total.sulfur.dioxide and alcohol are log transformed.

Bivariate Plots Section

red_wine$quality.factor <- as.factor(red_wine$quality)

I am gonna plot correlation plots of all features using ggpairs. For convience I want to do this in two steps.

Now we are going to explore relationship between quality and other features closely. In these plots, I am also plotting mean values inside the boxplots. Now We will have an additional parameter to explore our plots.

quality vs fixed.acidity

The figure shows a complex relationship between Quality and Fixed.acidity. There doesn’t seem to be significant pattern between the two.

quality vs volatile.acidity

It seems that high quality wines have lower volatile.acidity levels

quality vs citric.acid

The plot shows high quality wines tends to have high citric acid. But there are some outliers in quality 7, which have approximately 0.00 citric.acid and one value with 1.00 have quality 4.

quality vs residual.sugar

There doesn’t seem to be a relationship between quality and residual.sugar

quality vs chlorides

Overall not that much difference. But high quality wine contains less chlorides and low quality wine contains high chlorides

quality vs free.sulfur.dioxide

We can’t infer anything about relationship between these two.

quality vs total.sulfur.dioxide

Same as above, we can’t infer anything from figure.

quality vs density

the plot shows high quality wine tends to have low density

quality vs pH

There doesn’t seem to be any significant difference in mean pH values of each quality. So this is not an important factor for quality.

quality vs sulphates

High quality wines tends to have more sulphates

quality vs alcohol

Hig quality wines have higher alcohol levels.

Density Plots

We can observe in density plots that wine of medium quality 5 more often fall into the range of low citric.acid, low sulphates and low alcohol than wine of quality 4 or 3. red wine quality 5 occur quite often in high volatile acidity

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

A: Plots of quality against different features of red wine shows that volatile.acidity, citric.acid, sulphates and alcohol are strongly related to quality. lower volatile.acidity, higher sulphates and higher alcohol contribute to higher quality. The plots are also showing that fixed.acidity, residual.sugar and pH are not important factors for quality wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

A: According to scatter plot of all features, fixed.acidity and citric.acid has high positive correlation, whereas volatile.acidity has high negative correlation. Even though both free/total sulfur dioxide and sulphates both have sulfur in common, there is no correlation between them and effect on red wine quality is different.

What was the strongest relationship you found?

A: The correlation coefficients between quality and volatile.acidity, citric.acid, sulphates, alcohol are -0.3906, 0.2264, 0.2514, 0.4762 respectively. The strongest relationship is between alcohol and quality.

Multivariate Plots Section

In Bivariate plotting section, we found that the four important features that influence quality are volatile.acidity, citric.acid, sulphates and alcohol. We will the combined effect of all these features on quality in this section.

alcohol with other properties

Above plots show that high alcohol and high sulphates both immensely contribute to high quality of wine.

volatile.acidity with other properties

Above plots show that high volatile tends to have low quality wine, but on wines which have citric acid 0.75 to 1.0, the affect is insignificant.

sulphates with other properties

Above plots show that high sulphates contribute to high quality in all different features. When citric acid is in the interval of (0.75,1), the increase of sulphates would cause the quality to drop. Increasing the sulphates would cause the increase in quality best when alcohol is in range (11.4, 12.9) compared to other ranges of alcohol.

We can build a linear model to predict the quality of wine using above features.

## 
## Call:
## lm(formula = quality ~ alcohol + sulphates + volatile.acidity + 
##     citric.acid, data = red_wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71408 -0.38590 -0.06402  0.46657  2.20393 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.64592    0.20106  13.160  < 2e-16 ***
## alcohol           0.30908    0.01581  19.553  < 2e-16 ***
## sulphates         0.69552    0.10311   6.746 2.12e-11 ***
## volatile.acidity -1.26506    0.11266 -11.229  < 2e-16 ***
## citric.acid      -0.07913    0.10381  -0.762    0.446    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6588 on 1594 degrees of freedom
## Multiple R-squared:  0.3361, Adjusted R-squared:  0.3345 
## F-statistic: 201.8 on 4 and 1594 DF,  p-value: < 2.2e-16

Linear Regression is used to set up the relationship between alcohol and other features. The intercept is 2.645 and coefficients for alcohol, sulphates, volatile.acidity and citric.acid are 0.30908, 0.69552, -1.26506 and -0.07913 respectively.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

A: High alcohol contribute to high quality wine, adding sulphates will increase the quality more. low volatile acidity contribute to high quality wine. The other features will only affect quality when volatile acidity is low. sulphates contribute to quality positively, but when combine with other there are some outliers like when alcohol is between 11.4 and 12.9 and citric acid is between 0.75 and 1.

Were there any interesting or surprising interactions between features?

A: Individually alcohol, volatile acidity, citric acid and sulphates contribute to quality but when combine with each other they are not working as expected.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

A: Yes, I created a linear model to predict quality based on alcohol, sulphates, volatile.acidity and citric.acid. This model uses all the features that influence the quality. However this is pretty basic model. We need to revise the model to have good accuracy of quality prediction.

Final Plots and Summary

Plot One

Description One

A: Quality is our target feature. This histogram shows majority of wines have quality either 5 or 6. To say precisely 82% of total wines are either 5 or 6.

Plot Two

Description Two

A: The plot shows boxplots of quality and other four important features that influence quality such as volatile.acidity, citric.acid, sulphates and alcohol. In these volatile.acidity negatively influencing quality whereas the other three are positively influencing quality. Of all these alcohol has strong correlation with quality. Higher quality wine tends to have higher alcohol.

Plot Three

Description Three

A: There is a strong relationship between quality and alcohol. Adding of sulphates positively influencing the quality of wine. When sulphates are in (0.73, 2), the quality is high even though alcohol is between 11 and 13. Most of the wines that have quality below 5 have sulphates either (0.33, 0.55) and (0.55, 0.62). Even though some wines have same alcohol percentage, adding of sulphates significantly increases the quality.

Reflection

The red wine dataset contains 1599 observations. Each observation has 12 features. Our target variable is quality which has mean of 5.636. First I plotted histograms of all features. In these I observed that most of the wines have quality either 5 or 6. After that I plotted scatter plots of all variables. quality has strong positive correlation with alcohol, sulphates, citric.acid and strong negative correlation with volatile.acidity. I plotted boxplots of quality versus all other features to explore the relationship between quality and other variables. To know how these important features interact with each other I drew multivariate plots by dividing these features as different slots. alcohol and sulphates both are influencing quality of wine. But all other features are showing mixed results when combine with each other. I created a linear model to predict the quality of wine using four important features such as alcohol, sulphates, citric.acid and volatile.acidity. The model can be revised. We have very limited dataset. Having more data significantly increases the accuracy of our model. But there are so many other models that could be best fit our data like decision trees, random forests and other boosting models.